AIML Module Project - UNSUPERVISED LEARNING


Importing Required Python Modules and Libraries

Here we are importing all the Libraries and Modules that are needed for whole project in a single cell.


Part ONE - Project Based


DOMAIN: Automobile

CONTEXT: The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes.

DATA DESCRIPTION: The data concerns city-cycle fuel consumption in miles per gallon.

PROJECT OBJECTIVE: Goal is to cluster the data and treat them as individual datasets to train Regression models to predict ‘mpg’.

Steps and tasks:

  1. Import and warehouse data:
    • Import all the given datasets and explore shape and size.
    • Merge all datasets onto one and explore final shape and size.
    • Export the final dataset and store it on local machine in .csv, .xlsx and .json format for future use.
    • Import the data from above steps into python.
  1. Data cleansing:
    • Missing/incorrect value treatment
    • Drop attribute/s if required using relevant functional knowledge
    • Perform another kind of corrections/treatment on the data.
  1. Data analysis & visualisation:

    • Perform detailed statistical analysis on the data.
    • Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis.\

      Hint: Use your best analytical approach. Even you can mix match columns to create new columns which can be used for better analysis. Create your own features if required. Be highly experimental and analytical here to find hidden patterns.

  1. Machine learning:
    • Use K Means and Hierarchical clustering to find out the optimal number of clusters in the data.
    • Share your insights about the difference in using these two methods.
  1. Answer below questions based on outcomes of using ML based methods.
    • Mention how many optimal clusters are present in the data and what could be the possible reason behind it.
    • Use linear regression model on different clusters separately and print the coefficients of the models individually
    • How using different models for different clusters will be helpful in this case and how it will be different than using one single model without Clustering? Mention how it impacts performance and prediction.
  1. Improvisation:
    • Detailed suggestions or improvements or on quality, quantity, variety, velocity, veracity etc. on the data points collected by the company to perform a better data analysis in future.

1. Import and Warehouse Data:

* Import all the given Datasets and Explore Shape and Size of each.


Key Observations:-


* Merge all Datasets onto One and Explore Final Shape and Size.


Key Observations:-


* Export the Final Dataset and Store it on Local Machine in .csv, .xlsx and .json format for Future Use.


Key Observations:-


* Import the Data from above steps into python.


Key Observations:-


2. Data Cleansing:

* Missing/incorrect value treatment

Checking for Null Values in the Attributes


Key Observations:-


* Drop attribute/s if required using relevant functional knowledge


Key Observations:-


* Perform another kind of corrections/treatment on the data.

Checking for Dirty Values in Dataset values


Key Observations:-


Replacing Dirty Values of the Attribute by 0


Key Observations:-


Checking Datatypes of Attributes


Key Observations:-


Converting 'hp' Attribute datatype to integer


Key Observations:-


Replacing 0 by Mean value of hp Attribute


Key Observations:-


Comparing Before and after Correction/Treatement Process


Key Observations:-


3. Data Analysis & Visualisation:

* Perform Detailed Statistical Analysis on the Data.

Brief Summary of Data

Checking skewness of the data attributes

Checking Variance of the data attributes

Getting Interquartile Range of data attributes

Checking Correlation by plotting Heatmap for attributes


Key Observations:-


* Perform a Detailed Univariate, Bivariate and Multivariate Analysis with Appropriate detailed comments after Each Analysis.

Univariate Analysis

Univariate analysis is the simplest form of analyzing data. It involves only one variable.

Creating Functions for Plotting the Quantitative and Categorical Data for Univariate Analysis.

We will use these functions for easy analysis of individual attribute.

Attribute 1: "mpg"

Attribute 2: "cyl"

Attribute 3: "disp"

Attribute 4: "hp"

Attribute 5: "wt"h

Attribute 6: "acc"

Attribute 7: "yr"

Attribute 8: "origin"

Bivariate Analysis

Creating Functions for Plotting the Quantitative VS Categorical Data for Bivariate Analysis

Bivariate Analysis 1: cyl VS All Quantitative Attributes

Bivariate Analysis 2: yr VS All Quantitative Attributes

Bivariate Analysis 3: origin VS All Quantitative Attributes

Multivariate Analysis

Multivariate analysis is performed to understand interactions between different fields in the dataset.

Multivariate Analysis : To Check Relation Between Attributes

Multivariate Analysis : To Check Correlation


Outlier Analysis

NOTE:- Here we are Replacing Outliers by Mean of the Attribute without outliers. That is we will calculate Mean without outliers and then replace outliers with this calculated Mean


Key Observations:-


Feature Scaling Standardization(Z-Score Normalization)

Scaling is needed for our data, we will scale all discrete features with Z-Score Normalization. By Standardizing the values of dataset, we get the following statistics of the data distribution,


Key Observations:-


4. Machine learning:

* Use K Means and Hierarchical clustering to find out the optimal number of clusters in the data.

K Means Clustering

Finding the optimal number of Clusters


Key Observations:-


Fitting the Model with K = 2


Key Observations:-


Analyze the distribution of the data among the Groups


Key Observations:-


Hierarchical Clustering


Key Observations:-


Plotting Dendogram


Key Observations:-


* Share your insights about the difference in using these two methods.

5. Answer below questions based on outcomes of using ML based methods.

* Mention how many optimal clusters are present in the data and what could be the possible reason behind it.

* Use linear regression model on different clusters separately and print the coefficients of the models individually

Seperating two Clusters and Segregating Predictors VS Target Attributes


Key Observations:-


Fitting Linear Regression Model and Getting Coefficients


Key Observations:-


* How using different models for different clusters will be helpful in this case and how it will be different than using one single model without clustering? Mention how it impacts performance and prediction.

6. Conclusion and Improvisation:

* Detailed suggestions or improvements or on quality, quantity, variety, velocity, veracity etc. on the data points collected by the company to perform a better data analysis in future.

Closing Sentence:- Clustering of Data is done. Clusters are treated as individual data and Regression Models are trained for each data.

-------------------------------------------------- End of Part ONE-------------------------------------------------------


Part TWO - Project Based


DOMAIN: Manufacturing

CONTEXT: Company X curates and packages wine across various vineyards spread throughout the country.

DATA DESCRIPTION: The data concerns the chemical composition of the wine and its respective quality. Attribute Information:

  1. A, B, C, D: specific chemical composition measure of the wine
  2. Quality: quality of wine

PROJECT OBJECTIVE: Goal is to build a synthetic data generation model using the existing data provided by the company.

Steps and tasks: Design a synthetic data generation model which can impute values [Attribute: Quality] wherever empty the company has missed recording the data.

Design a synthetic data generation model which can impute values [Attribute: Quality] wherever empty the company has missed recording the data.

Exploring Data


Key Observations:-


Checking Number of Empty Attributes in Quality


Key Observations:-


Scaling and Finding Optimal K Value


Key Observations:-


Fitting K-Means Model and Getting Clusters


Key Observations:-


Compare the clusters with the Existing Target


Key Observations:-


Replacing 1s and 0s with respective Qualities to the Dataset


Key Observations:-


Checking the Number of Empty Values After Replacement


Key Observations:-


Displaying Final Outcome

Closing Sentence:- Synthetic data generation model is built using the existing data provided by the company

-------------------------------------------------- End of Part TWO-------------------------------------------------------


Part THREE - Project Based


DOMAIN: Automobile

CONTEXT: The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette.The vehicle may be viewed from one of many different angles.

DATA DESCRIPTION: The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.

PROJECT OBJECTIVE: Apply dimensionality reduction technique – PCA and train a model using principal components instead of training the model using just the raw data.

Steps and tasks:

  1. Data: Import, clean and pre-process the data
  2. EDA and visualisation: Create a detailed performance report using univariate, bi-variate and multivariate EDA techniques. Find out all possible hiddenpatterns by using all possible methods.

    For example: Use your best analytical approach to build this report. Even you can mix match columns to create new columns which can be used for better analysis. Create your own features if required. Be highly experimental and analytical here to find hidden patterns.

  1. Classifier: Design and train a best fit SVM classier using all the data attributes.
  2. Dimensional reduction: perform dimensional reduction on the data.
  3. Classifier: Design and train a best fit SVM classier using dimensionally reduced attributes.
  4. Conclusion: Showcase key pointer on how dimensional reduction helped in this case.

1. Data: Import, Clean and Pre-Process the Data

Exploring Dataset


Key Observations:-


Checking for Null Values in the Attributes


Key Observations:-

Dropping Null Values


Key Observations:-


Checking the Datatypes of Each Attribute.


Key Observations:-


Outlier Analysis

NOTE:- Here we are Replacing Outliers by Mean of the Attribute without outliers. That is we will calculate Mean without outliers and then replace outliers with this calculated Mean


Key Observations:-


2. EDA and visualisation: Create a detailed performance report using univariate, bi-variate and multivariate EDA techniques. Find out all possible hiddenpatterns by using all possible methods

Brief Summary of Data

Checking Skewness of the data attributes

Checking Variance of the data attributes


Key Observations:-


Univariate Analysis


Key Observations:-


Bivariate Analysis


Multivariate Analysis

Multivariate analysis is performed to understand interactions between different fields in the dataset.

Multivariate Analysis : To Check Relation Between Attributes

Multivariate Analysis : To Check Correlation


Key Observations:-


3. Classifier: Design and train a best fit SVM classier using all the data attributes.

Segregating Predictors VS Target Attributes and Scaling by zscores


Key Observations:-


Check for Target Balancing


Key Observations:-


Fixing Target Imbalance by Synthetic Minority Oversampling Technique (SMOTE)

SMOTE first selects a minority class instance a at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b.


Key Observations:-


Performing Train-Test Split.


Key Observations:-


Fitting SVM Model and Getting Accuracy


Key Observations:-


4. Dimensional reduction: Perform Dimensional Reduction on the Data.

Fitting PCA

Plotting Eigen Value to get dimension


Key Observations:-


Fitting PCA with 8 dimensions and Transforming Predictors


Key Observations:-


5. Classifier: Design and train a best fit SVM classier using Dimensionally Reduced attributes

Splitting Data and Fitting SVM to get Accuracies

6. Conclusion: Showcase key pointer on how dimensional reduction helped in this case.

Closing Sentence:- Dimension Reduction Technique(PCA) is implemented and Model is Trained using principal components instead of training the model using just the raw data.

-------------------------------------------------- End of Part THREE -------------------------------------------------------


Part FOUR - Project Based


DOMAIN: Sports management

CONTEXT: Company X is a sports management company for international cricket.

DATA DESCRIPTION: The data is collected belongs to batsman from IPL series conducted so far. Attribute Information:

  1. Runs: Runs score by the batsman
  2. Ave: Average runs scored by the batsman per match
  3. SR: strike rate of the batsman
  4. Fours: number of boundary/four scored
  5. Six: number of boundary/six scored
  6. HF: number of half centuries scored so far

PROJECT OBJECTIVE: Goal is to build a data driven batsman ranking model for the sports management company to make business decisions.

Steps and tasks:

  1. EDA and visualisation: Create a detailed performance report using univariate, bi-variate and multivariate EDA techniques. Find out all possible hidden patterns by using all possible methods.
  2. Build a data driven model to rank all the players in the dataset using all or the most important performance features.

Exploring Dataset


Key Observations:-


Checking for Null Values in the Attributes


Key Observations:-


Dropping Null Values


Key Observations:-


1. EDA and Visualisation: Create a detailed performance report using univariate, bi-variate and multivariate EDA techniques. Find out all possible hidden patterns by using all possible methods.

Brief Summary of Data

Checking Skewness of the data attributes

Checking Variance of the data attributes

Checking Correlation by plotting Heatmap for attributes


Key Observations:-


Univariate Analysis

Univariate analysis is the simplest form of analyzing data. It involves only one variable.

Creating Functions for Plotting the Data for Univariate Analysis.

We will use these functions for easy analysis of individual attribute.

Attribute 1: "Runs"

Attribute 2: "Ave"

Attribute 2: "Ave"

Attribute 2: "SR"

Attribute 2: "Fours"

Attribute 2: "Sixes"

Attribute 2: "HF"

Bivariate Analysis

Multivariate Analysis

Multivariate analysis is performed to understand interactions between different fields in the dataset.

Multivariate Analysis : To Check Relation Between Attributes

Multivariate Analysis : To check Density of Categorical Attribute in all other Attributes

Multivariate Analysis : To Check Correlation


Outlier Analysis

NOTE:- Here we are Replacing Outliers by Mean of the Attribute without outliers. That is we will calculate Mean without outliers and then replace outliers with this calculated Mean


Key Observations:-


2. Build a data driven model to rank all the players in the dataset using all or the most important performance features

Fitting PCA

Plotting Eigen Value to get dimension


Key Observations:-


Fitting PCA with 4 dimensions and Transforming Predictors


Key Observations:-


Getting index values of transformed data after sorting them


Key Observations:-


Re-orderinng data based on index


Key Observations:-


Finalised Sorted Players


Key Observations:-


Closing Sentence:- Data driven batsman ranking model is built based on performance for the sports management company to make business decisions.

-------------------------------------------------- End of Part FOUR -------------------------------------------------------


Part FIVE - Question Based


Questions:

  1. List down all possible dimensionality reduction techniques that can be implemented using python.
  2. So far you have used dimensional reduction on numeric data. Is it possible to do the same on a multimedia data [images and video] and text data ? Please illustrate your findings using a simple implementation on python.

1. List down all possible dimensionality reduction techniques that can be implemented using python.

Dimensionality Reduction Techniques:-

1. Principal Component Analysis: PCA is a technique which helps us in extracting a new set of variables from an existing large set of variables. These newly extracted variables are called Principal Components.

2. Random Forest: Random Forest is one of the most widely used algorithms for feature selection. It comes packaged with in-built feature importance so you don’t need to program that separately. This helps us select a smaller subset of features.

3. Missing Value Ratio: Feature Selection plays a key role in reducing the dimensions of any dataset. There are various benefits of dimensionality reduction including reduced computational/training time of a dataset, lesser dimensions lead to better visualization, etc. And Missing Value Ratio is one of the basic feature selection techniques.

4. Low Variance Filter: Low Variance Filter is a useful dimensionality reduction algorithm. The variance is a statistical measure of the amount of variation in the given variable. If the variance is too low, it means that it does not change much and hence it can be ignored.

5. High Correlation Filter: This dimensionality reduction algorithm tries to discard inputs that are very similar to others. If there is a very high correlation between two input variables, we can safely drop one of them.

6. Forward Feature Selection: Forward feature selection starts with the evaluation of each individual feature, and selects that which results in the best performing selected algorithm model.

7. Backward Feature Elimination: Backward elimination is a feature selection technique while building a machine learning model. It is used to remove those features that do not have a significant effect on the dependent variable or prediction of output.

8. Independent Component Analysis: Independent Component Analysis (ICA) extracts hidden factors within data by transforming a set of variables to a new set that is maximally independent.

9. Factor Analysis: Factor analysis is a technique that is used to reduce a large number of variables into fewer numbers of factors. This technique extracts maximum common variance from all variables and puts them into a common score.

10. t-Distributed Stochastic Neighbor Embedding (t-SNE): t-Distributed Stochastic Neighbor Embedding (t-SNE) is a technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets.

11. Uniform Manifold Approximation and Projection (UMAP) UMAP is a dimension reduction technique that can preserve as much of the local, and more of the global data structure as compared to t-SNE, with a shorter runtime.

12.Random Projection (RP): In RP, a higher dimensional data is projected onto a lower-dimensional subspace using a random matrix whose columns have unit length.

13. Singular Value Decomposition (SVD): Reducing the number of input variables for a predictive model is referred to as dimensionality reduction. Fewer input variables can result in a simpler predictive model that may have better performance when making predictions on new data.

14. Semantic Analysis (LSA): LSA learns latent topics by performing a matrix decomposition on the document-term matrix using Singular value decomposition. LSA is typically used as a dimension reduction or noise reducing technique.


2. So far you have used dimensional reduction on numeric data. Is it possible to do the same on a multimedia data [images and video] and text data ? Please illustrate your findings using a simple implementation on python.

Lets talk about PCA dimension reduction technique

Implementation of Dimensional Reduction on an Image:-

Loading the Image


Key Observations:-


Checking Shape and Dimension of the Image


Key Observations:-


Converting the Image to 2 Dimension by Re-shaping


Key Observations:-


Applying PCA to Compress the Image


Key Observations:-


Comparing Original Image with PCA Compressed Image


Key Observations:-


Closing Sentence:- Dimensional Reduction Technique has been Illustrated on a Multimedia Data Image

-------------------------------------------------- End of Part FIVE -------------------------------------------------------